Immune Cell Infiltration Across Breast Cancer Subtypes: A Marker Gene Analysis of the METABRIC Cohort
Author
cyy36@cam.ac.uk
Published
February 23, 2026
Cover Page
MSt Healthcare Data Science
Authentication of practice
I confirm that I have fully read and understood the assignment brief for this module.
Y
Details
Name: Chung Yan
Surname: Yu
Submission Date
Word Count: whole assignment including codes
Word Count: main body excluding abstract, references and supplementary materials
Permission to share your assignment
I do give permission to share my assignment with future MSt participants.
University statement of originality
This assignment is the result of my own work and includes nothing which is the outcome of work done in collaboration except as declared in the Preface and specified in the text. It is not substantially the same as any that I have previously submitted for a degree or diploma or other qualification at the University of Cambridge or any other university or similar institution, or that is being concurrently submitted, except as declared in the Preface and specific in the text. I further state that no substantial part of my Portfolio has already been submitted, nor is being concurrently submitted for any such degree, diploma or other qualification at the University of Cambridge or any other university or similar institution except as declared in the Preface and specified in the text.
I confirm the statement of originality as above
Y
Questions for reflection
Self-assessment is an important aspect of feedback literacy, which is, in turn, key to the development of expertise. As you proceed through the MSt Healthcare data science programme, we hope that you will make use of the following prompts to assess your own work on assignments. Specific assignment briefs will likely indicate which of these to address for which assessments, but, in general, we expect you to respond to one or two for each assignment on your course.
For each of the questions, do not spend too long answering – keep it brief. For each question you answer, limit yourself to no more than three items. And please remember, this is optional and developmental: these cover sheets are designed to create space for self-assessment and feedback dialogue, rather than additional assignment workload.
Which aspects of this assignment are you most uncertain about and/or would most like to receive feedback on?
What elements are you left pondering after this assignment that you would like to discuss further?
How have you incorporated feedback from peers and tutors into this assignment?
How, and to what extent, have you been able to incorporate feedback on previous course work into this assignment?
Using the wording in the rubric, how would you describe the quality of the different aspects of your work?
Declaration of the use of generative AI
Which permitted use of generative AI are you acknowledging?
Semantic search of literature and notes, outline creation, output formatting, informed feedback, code generation
Which generative AI tool did you use (name and version)?
Claude Code v2.1.1x (Opus 4.6), M365 CoPilot
What did you use the tool for?
Searching for relevant journal publications (via PubMed MCP), collating notes in my Obsidian vault, .bib file curation, code debugging+generation
How have you used or changed the generative AI’s output
My collated notes always stay in point form so that I write out the paragraphs in my own words. Feedback provided by the GenAI models are weighed and assessed before being acted on.
Abstract
The ability for immune cells to infiltrate the tumour microenvironment (TME) plays a key role in the biological war machine taking on breast cancer.
Introduction
Tumour cells evading the detection and wrath of the immune system is a recognised hallmark of cancer1, and breast cancer is no exception2. The interplay between tumour cells, immune cells and other local factors form the tumour microenvironment (TME), and research in the TME of breast cancers has yielded clinical applications3. The amount of tumour infiltrating immune cells in breast cancers has shown prognostic value4, and this amount and prognostic impact varies across breast cancer’s molecular subtypes5. These subtypes are defined by gene expression6, and are popularised and commercialised as the PAM50 gene expression panel, which has its own prognostic significance7. The subtypes include luminal A, luminal B, HER2-enriched, basal-like and normal-like.
The METABRIC cohort from Curtis et al.8 is one of the larger datasets on breast cancer containing clinical, gene expression and mutation data of 2,509 patients. METABRIC’s analysis has already shown immune infiltration in one of their integrative clusters (IntClust4), showing deletions in T cell receptors and mRNA amplification for immune cell activity, but stopped short of fully characterising the immune cells involved. Jiang et al.9 characterised immune cell infiltration in the METABRIC cohort using CIBERSORT10, developing new immunotypes with prognostic significance. However, CIBERSORT is more of a black box model, a computational approach using deconvolution to estimate cell proportions from RNA mixtures. Therefore there exists a need to use a transparent approach to score immune cell infiltration in METABRIC’s breast tumour samples, and scrutinise any prognostic value from such scoring.
A panel of immune cell marker genes developed by Danaher et al.11 were used to score immune infiltration in the METABRIC cohort. The infiltration of different immune cell types as well as their prognostic value were investigated across the PAM50 subtypes. The goal is to characterise the immune landscape across PAM50 subtypes, identify which immune populations associate with survival, and test whether immune cell scores or multi-dimensional immune profiles improve prognostic value beyond subtype classification alone.
Methods
Click here to see how to pull fresh data straight from cBioPortal.
# To pull fresh data from cBioPortal instead of using the included processed files,# delete the data/processed/ folder and re-render this document.source("scripts/00-data-cleaning.R")source("scripts/01-immune-cell-scoring.R")source("scripts/02-task1-exploration.R")
This study makes use of the METABRIC dataset8 available on cBioPortal12–14, including clinical, gene expression and mutation datasets of 2,509 patients. For gene expression, the log2 intensity was used in place of the normalised z-score to preserve the proportions when applying the immune cell scoring.
The principle of avoiding circularity guided two data selection decisions. First, PAM50 subtypes were chosen over IntClust as IntClust subtypes are defined by copy number aberrations that can directly influence immune marker gene expression, and IntClust4 is already characterised by immune infiltration. The clear lack of overlap between the PAM50 and Danaher’s immune cell gene panels also strengthened this choice. Second, the exclusion of Claudin-low patients. Claudin-low is another breast cancer subtype used alongside PAM50 subtypes on cBioPortal’s METABRIC dataset, and is characterised by immune infiltration15. Fougner et al.15 further demonstrated that Claudin-low is not an independent subtype in addition to the PAM50 subtypes, therefore patients of this category (n = 218) were removed.
Immune cell scoring is done by taking the mean log2 gene expression of the cell types’ marker genes, originally defined by Danaher et al.11 and tabulated in Table S1. Of the 60 marker genes from the Danaher panel, only 58 genes have expression data in the METABRIC dataset. TPSB2 is missing for mast cells and XCL2 for NK cells. The CD4 T cell score is derived by subtracting the CD8 T cell score from the overall T cell score. The Danaher scoring method was chosen for its simplicity and marker genes validated against flow cytometry, in contrast to deconvolution approaches previously applied to this cohort9.
Verifying the immune marker gene coverage on the Illumina HT-12 v3 microarray of METABRIC dataset.
Immune variation across PAM50 subtypes was assessed by Kruskal-Wallis test with Wilcoxon for pairwise comparisons. Prognostic associations were evaluated using Cox proportional hazards models, first univariately and then multivariately to adjust for PAM50 subtypes, along with Kaplan-Meier curves with log-rank tests for visualisation. To develop multi-dimensional immune profiles, clusters were fitted via k-means with silhouette scoring. All p-values from multiple testing were corrected via Benjamini-Hochberg false discovery rate (FDR) and reported as FDR throughout. Overall survival was used for prognosis as the disease-specific survival data used by Curtis et al.8 was not available on cBioPortal.
The whole analysis is stored in this GitHub repository, including the environment.yml tracking the packages used, including ComplexHeatmap16,17, survival18,19, survminer20 and cBioPortalData21.
Results
Cohort Summary and Mutation Landscape
NoteFigure 1 — Data availability across modalities
Intersection plot showing patient counts for each combination of available data modalities (N = 2,509).
From the n = 2509 METABRIC cohort, 529 patients were without expression data (Figure 1). These were significantly younger than the 1980 with expression data (median age 58 vs 61.8 years; Wilcoxon rank-sum p < 0.001). The further removal of patients with Claudin-low (n=218) and NC (Not Classified, n=6) left 1756 patients for further analysis.
Characteristic
Overall N = 1756 (100%)
Basal-like N = 209 (12%)
HER2-Enriched N = 224 (13%)
Luminal A N = 700 (40%)
Luminal B N = 475 (27%)
Normal-like N = 148 (8.4%)
p-value
Age at diagnosis
62 (52, 71)
54 (44, 65)
59 (50, 69)
63 (53, 72)
67 (59, 74)
57 (46, 66)
<0.001
ER status
<0.001
Negative
339 (20%)
172 (84%)
115 (53%)
13 (1.9%)
11 (2.4%)
28 (19%)
Positive
1,386 (80%)
33 (16%)
103 (47%)
677 (98%)
456 (98%)
117 (81%)
Unknown
31
4
6
10
8
3
Tumour Grade
<0.001
1
154 (9.2%)
2 (1.0%)
4 (1.9%)
118 (18%)
18 (3.9%)
12 (8.6%)
2
714 (42%)
17 (8.3%)
54 (25%)
378 (57%)
186 (41%)
79 (57%)
3
815 (48%)
187 (91%)
155 (73%)
172 (26%)
253 (55%)
48 (35%)
Unknown
73
3
11
32
18
9
Cellularity
<0.001
High
898 (52%)
124 (60%)
125 (59%)
326 (47%)
292 (63%)
31 (23%)
Low
163 (9.5%)
28 (14%)
15 (7.0%)
53 (7.7%)
22 (4.7%)
45 (34%)
Moderate
650 (38%)
54 (26%)
73 (34%)
312 (45%)
153 (33%)
58 (43%)
Unknown
45
3
11
9
8
14
Follow-up (months)
116 (60, 184)
86 (33, 183)
97 (39, 172)
130 (82, 197)
104 (55, 164)
121 (64, 176)
<0.001
Deceased
1,044 (59%)
115 (55%)
157 (70%)
378 (54%)
315 (66%)
79 (53%)
<0.001
NoteTable 1 — Cohort characteristics by PAM50 subtype
Baseline characteristics of 1,756 METABRIC patients with expression data, stratified by PAM50 subtype. Continuous variables summarised as median (Q1, Q3); categorical as n (%). P-values from Kruskal-Wallis (continuous) or Pearson’s Chi-squared (categorical) tests.
Of the 1756 patients, the luminal A is the largest group at 40%, with basal-like and HER2-enriched being smaller and also younger. Overall survival is at 41%, with HER2-enriched having the worst survival at 30%.
Median immune cell score differences (mutated minus wild-type) for genes with at least one significant association across 14 cell types. Significance by Wilcoxon rank-sum test (FDR < 0.05); only genes mutated in ≥10 patients tested.
NoteFigure 3 — TP53 mutation and immune infiltration after PAM50 adjustment
PAM50-adjusted effect sizes from linear regression (x-axis) and significance (y-axis) for each immune cell type. Dashed line marks FDR = 0.05.
Immune Infiltration Across PAM50 Subtypes
NoteFigure 4 — Immune cell score heatmap across PAM50 subtypes
Z-scored immune cell scores for 1,756 patients, split and hierarchically clustered by PAM50 subtype. Blue = below-mean; red = above-mean infiltration.
NoteFigure 5 — Immune cell scores by PAM50 subtype
Violin plots with box plots for cell types reaching Kruskal-Wallis FDR < 0.05. Lettering above each subtype plot indicate similarity/difference, subtypes sharing a letter do not differ significantly (pairwise Wilcoxon, FDR < 0.05).
Hazard ratios per SD increase from univariate Cox proportional hazards models for each immune cell type (FDR < 0.05).
Despite reaching FDR significance, the best-performing single cell type (Mast cells) achieved a concordance of only 0.55, indicating modest discriminative ability. Given the strong subtype dependence of immune infiltration observed in Q1, we next asked whether adjusting for PAM50 subtype alters the prognostic landscape.
PAM50-Adjusted Immune Prognostic Analysis
To account for subtype confounding, we fitted multivariable Cox models adjusting for PAM50 subtype for each of the 14 cell types. 4 cell types reached FDR < 0.05 after adjustment (B cells, Cytotoxic cells, CD8 T cells, Mast cells). For these, we additionally examined per-subtype effects.
Cox proportional hazards models adjusted for PAM50 subtype. Forest plot shows the 4 cell types reaching FDR < 0.05; table shows all 14.
Interaction models (immune score × PAM50 subtype) revealed no significant heterogeneity for B cells (p = 0.047, FDR = 0.189), Cytotoxic cells (p = 0.250, FDR = 0.333), CD8 T cells (p = 0.443, FDR = 0.443), Mast cells (p = 0.102, FDR = 0.205), suggesting that while the magnitude of the prognostic effect varies by subtype, the direction is broadly consistent.
Patients split at median score. P-values from log-rank test; titles show PAM50-adjusted HR and FDR.
Aggregating Immune Scores via Clustering
NoteFigure 9 — Immune cluster centroids
Mean z-scored immune cell type profiles for k-means clusters. Left: 2 clusters from all 14 cell types. Right: 2 clusters from the 4 PAM50-adjusted significant cell types only. Clusters labelled by overall immune level. Red = above-average infiltration; blue = below-average.
Model
df
Concordance
AIC
LR p vs PAM50
ΔAIC
PAM50 subtype only
4
0.576
14057.6
—
0.0
PAM50 + B cells
5
0.587
14047.5
<0.001
-10.1
PAM50 + Cytotoxic cells
5
0.582
14050.3
0.002
-7.3
PAM50 + CD8 T cells
5
0.582
14051.8
0.005
-5.8
PAM50 + Mast cells
5
0.584
14052.2
0.007
-5.4
PAM50 + Cluster (4 sig, k=2)
5
0.577
14056.6
0.086
-0.9
PAM50 + Cluster (all 14, k=2)
5
0.576
14058.5
0.314
1.0
NoteTable 4 — Prognostic comparison: PAM50 baseline vs immune-augmented models
LR test p-values against PAM50-only baseline. ΔAIC relative to baseline (negative = better fit).
Even if my day job is somewhat cancer adjacent, diving deep into this METABRIC dataset made me feel like a novice when it came to cancer / breast cancer. But it has been fun seeing what we learn in class come together in more ways than one. The lead authors for the immune cell scoring paper are from the company NanoString Technologies, the same company that has the commercial PAM50 test kit that we learned in class. And in the same paper, they cite the false discovery rate (FDR) correction paper from Benjamini and Hochberg!
This whole assignment has been a rollercoaster ride, and there were so many rabbit holes that this topic would lead you into. For example Danaher didn’t make it easy for me when they tweaked the TIL acronym, usually meaning tumour infiltrating lymphocytes, to sneak in leukocytes in the palce of lymphocytes, so that the immune cells of myeloid origin could be included in the paper. The inclusion of Claudin-low